Tidy data

Quantitative Methodology (UPF)

Jordi Mas Elias

https://www.jordimas.cat/

Summary

  • Introduction
  • Messy data
  • Missing data
  • Separate data
  • Codes and codebooks

Warm up

R learning curve

Warm up

Warm up

Install packages

install.packages(c("dplyr", "ggplot2", "readr", "tidyr", 
                   "countrycode", "janitor", "readxl", 
                   "wbstats", "naniar", "haven", "Hmisc"))

Warm up

Load packages.

library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(countrycode)
library(janitor)
library(readxl)
library(wbstats)
library(naniar)
library(haven)
library(Hmisc)

Introduction

Introduction

“Data Scientists spend up to 80% of the time on data cleaning and 20% on actual data analysis”.

Source: R for Data Science

Introduction

A df is tidy if it fulfills these requirements (Wickham 2014):

  • Each df has one unit of observation.
  • Observations represented in the rows.
  • Variables represented in the columns.
  • Each cell indicates a value.

Source: R for Data Science

Introduction

But …

Votes to parties in the 2019 Spanish elections

# A tibble: 15 × 9
   provincia municipio           votos…¹    pp    cs podem…² mas_p…³    eb pacma
   <chr>     <chr>                 <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl> <dbl>
 1 Almería   Abla                    710   193    47      30       0     2     3
 2 Almería   Abrucena                699   111    45      42       0     0     2
 3 Almería   Adra                  10941  3126   574     645       0    16    81
 4 Almería   Albánchez               272    56    41      18       0     0     2
 5 Almería   Alboloduy               419   155    17      10       0     0     2
 6 Almería   Albox                  4335  1261   378     247       0     2    28
 7 Almería   Alcolea                 509    92    39      43       0     1    11
 8 Almería   Alcóntar                371   125    20      15       0     2     3
 9 Almería   Alcudia de Monteag…     106    47     5       6       0     0     0
10 Almería   Alhabia                 409    95    35      23       0     0     1
11 Almería   Alhama de Almería      1984   402   103     182       0     0    18
12 Almería   Alicún                  150    38    11       7       0     0     3
13 Almería   Almería               92383 24495  7848    9699       0   204   964
14 Almería   Almócita                103    10     6      16       0     1     0
15 Almería   Alsodux                  81    16     1       9       0     3     3
# … with abbreviated variable names ¹​votos_validos, ²​podemos_iu, ³​mas_pais

Introduction

But …

Minor contracts granted in Barcelona

# A tibble: 15 × 5
   ens                        contracte       data       proveidor        import
   <chr>                      <chr>           <date>     <chr>             <dbl>
 1 CONSORCIS                  Serveis         NA         Marc Redorta      1646.
 2 CONSORCIS                  Serveis         NA         JESUS SANZ LOPEZ   360 
 3 CONSORCIS                  Serveis         NA         JOAN BOSCH MUNT…   690.
 4 CONSORCIS                  Serveis         NA         ALLIANZ, Compañ…   456.
 5 FUNDACIONS I ASSOCIACIONS  <NA>            NA         Vopi4Elecnor      1891.
 6 FUNDACIONS I ASSOCIACIONS  Serveis         2918-11-19 Jaume Badosa      1754.
 7 ORGANISMES AUTÒNOMS LOCALS Serveis         NA         WHADS MEDIA STU…  1392.
 8 ORGANISMES AUTÒNOMS LOCALS Serveis         NA         AJUNTAMENT BARC…  4655.
 9 ORGANISMES AUTÒNOMS LOCALS Subministrament NA         INADHOC HABITAT   9859.
10 ORGANISMES AUTÒNOMS LOCALS Serveis         NA         DOT CONSULTING   18090.
11 ORGANISMES AUTÒNOMS LOCALS Subministrament NA         SAYTEL           15562.
12 ORGANISMES AUTÒNOMS LOCALS Subministrament NA         ID GRUP           1197.
13 ORGANISMES AUTÒNOMS LOCALS Subministrament NA         ID GRUP           2117.
14 ORGANISMES AUTÒNOMS LOCALS Serveis         NA         CODINA, ESTHER    1600 
15 ORGANISMES AUTÒNOMS LOCALS Serveis         NA         ABELEIRA, ANTON…  1200 

Introduction

But …

Price of rents

# A tibble: 15 × 3
   nom_barri                              preu preu_m2
   <chr>                                 <dbl>   <dbl>
 1 el Raval                               590.   10.8 
 2 el Barri Gòtic                         713.   10.6 
 3 la Barceloneta                         541.   14.4 
 4 Sant Pere, Santa Caterina i la Ribera  673.   11.0 
 5 el Fort Pienc                          736.   10.4 
 6 la Sagrada Família                     673.   10.6 
 7 la Dreta de l'Eixample                 921.    9.84
 8 l'Antiga Esquerra de l'Eixample        828.   10.4 
 9 la Nova Esquerra de l'Eixample         716.   10.3 
10 Sant Antoni                            693.    9.77
11 el Poble Sec                           568    10.2 
12 la Marina del Prat Vermell              NA    NA   
13 la Marina de Port                      554.    8.34
14 la Font de la Guatlla                  632.   10.8 
15 Hostafrancs                            581.   10.4 

Income per capita

# A tibble: 15 × 3
   nom_barri    sc import_euros
   <chr>     <dbl>        <dbl>
 1 el Raval      1        27503
 2 el Raval      2        21913
 3 el Raval      3        24220
 4 el Raval      4        27405
 5 el Raval      5        23014
 6 el Raval      6        25581
 7 el Raval      7        25959
 8 el Raval      8        21900
 9 el Raval      9        24849
10 el Raval     10        23306
11 el Raval     11        21571
12 el Raval     12        25592
13 el Raval     13        27908
14 el Raval     14        26078
15 el Raval     15        26092

Introduction

Often, data is not tidy because it’s:

  • Messy: Variables are not represented in the columns.
  • Missing: Cells do not contain a value.
  • Separated: Observations are in many datasets.

Messy data

Tidy data?

Number of TB cases documented by WHO in Afghanistan, Brazil, and China between 1999 & 2000 (cases & population) (example from Wickham and Girlich (2022)).

Table A

# A tibble: 6 × 3
  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Table B

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Table C

# A tibble: 12 × 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Table D

# A tibble: 3 × 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

Pivoting data

We change the rows and columns of the dataframe keeping the same information.

  • Pivot longer: pivot_longer(df, cols, names_to, values_to)
  • Pivot wider: pivot_wider(df, names_from, values_from)
  • Separate: separate(df, col, into, sep)
  • Unite: unite(df, col, ..., sep)

Why data is not tidy?

Three reasons, among others:

  • Saves space (see IGO dataset)
  • Spreadsheet mindset.
  • Tidyverse revolution1.

Missing data

What is missing data?

Zero vs. NA (Not Available)

Why is data missing?

Special categories: People do not answer.

# A tibble: 1,599 × 4
   age   language       situation_country    situation_personal  
   <chr> <chr>          <chr>                <chr>               
 1 26    Igbo           Very Bad             Fairly Good         
 2 25    Other          Very Bad             Fairly Bad          
 3 35    Hausa          Fairly Good          Very good           
 4 79    Other          Neither good nor bad Neither good nor bad
 5 19    English        Very Bad             Fairly Good         
 6 34    Igbo           Don't know           Don't know          
 7 30    Pidgin English Very Bad             Very Bad            
 8 32    Hausa          Fairly Good          Very good           
 9 50    Other          Neither good nor bad Neither good nor bad
10 18    English        Very Bad             Neither good nor bad
# … with 1,589 more rows

Why is data missing?

Special categories: Hard to qualify.

# A tibble: 28 × 6
   ccode scode country       year polity polity2
   <dbl> <chr> <chr>        <dbl>  <dbl>   <dbl>
 1   560 SAF   South Africa  1991      5       5
 2   560 SAF   South Africa  1992    -88       6
 3   560 SAF   South Africa  1993    -88       8
 4   560 SAF   South Africa  1994      9       9
 5   560 SAF   South Africa  1995      9       9
 6   560 SAF   South Africa  1996      9       9
 7   560 SAF   South Africa  1997      9       9
 8   560 SAF   South Africa  1998      9       9
 9   560 SAF   South Africa  1999      9       9
10   560 SAF   South Africa  2000      9       9
# … with 18 more rows

Why is data missing?

Not known (NA) or strange/extreme value.

# A tibble: 7 × 8
    any  trim codi_dist districte   codi_barri nom_barri      preu preu_m2
  <dbl> <dbl>     <dbl> <chr>            <dbl> <chr>         <dbl>   <dbl>
1  2014     1         8 Nou Barris          47 Can Peguera    242.    6.03
2  2014     2         9 Sant Andreu         58 Baró de Viver  231.    4.03
3  2014     3         8 Nou Barris          47 Can Peguera    145.    3.54
4  2015     3         9 Sant Andreu         58 Baró de Viver  187     4   
5  2016     1         9 Sant Andreu         58 Baró de Viver  142.    3.18
6  2016     3         8 Nou Barris          54 Torre Baró       0    NA   
7  2016     3         9 Sant Andreu         58 Baró de Viver  256.    3.85

Why is data missing?

Data join typically produces NA.

  • See next chapter

Dealing with missing data

Some functions:

  • summary() for exploration.
  • is.na() is useful when filtering.
  • na.rm = TRUE argument in summary functions.
  • replace_na() assigns a value to NAs.
  • naniar::vis_miss() (be careful!!)

Dealing with missing data

The real problem is when the reason why data is missing is related to what we are studying:

  • Why students don’t come to class?
  • Why is there poverty in the world?

Join data

Join data

xxxx_join(df1, 
          df2, 
          by = c("join_variable_df1" = "join_variable_df2"))
  • full_join()
  • left_join()
  • right_join()
  • inner_join()

Join data

Classification territorial units & parties

Codebooks

Codebooks

Examples:

  • CHES dataset: Web
  • UCDP: Web
  • World Bank wbstats package.

Bibliography

Wickham, Hadley. 2014. Tidy Data.” Journal of Statistical Software 50 (10): 1–23.
Wickham, Hadley, and Maximilian Girlich. 2022. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.